26 research outputs found
XClusters: Explainability-first Clustering
We study the problem of explainability-first clustering where explainability
becomes a first-class citizen for clustering. Previous clustering approaches
use decision trees for explanation, but only after the clustering is completed.
In contrast, our approach is to perform clustering and decision tree training
holistically where the decision tree's performance and size also influence the
clustering results. We assume the attributes for clustering and explaining are
distinct, although this is not necessary. We observe that our problem is a
monotonic optimization where the objective function is a difference of
monotonic functions. We then propose an efficient branch-and-bound algorithm
for finding the best parameters that lead to a balance of cluster distortion
and decision tree explainability. Our experiments show that our method can
improve the explainability of any clustering that fits in our framework.Comment: 11 page
MixRL: Data Mixing Augmentation for Regression using Reinforcement Learning
Data augmentation is becoming essential for improving regression accuracy in
critical applications including manufacturing and finance. Existing techniques
for data augmentation largely focus on classification tasks and do not readily
apply to regression tasks. In particular, the recent Mixup techniques for
classification rely on the key assumption that linearity holds among training
examples, which is reasonable if the label space is discrete, but has
limitations when the label space is continuous as in regression. We show that
mixing examples that either have a large data or label distance may have an
increasingly-negative effect on model performance. Hence, we use the stricter
assumption that linearity only holds within certain data or label distances for
regression where the degree may vary by each example. We then propose MixRL, a
data augmentation meta learning framework for regression that learns for each
example how many nearest neighbors it should be mixed with for the best model
performance using a small validation set. MixRL achieves these objectives using
Monte Carlo policy gradient reinforcement learning. Our experiments conducted
both on synthetic and real datasets show that MixRL significantly outperforms
state-of-the-art data augmentation baselines. MixRL can also be integrated with
other classification Mixup techniques for better results.Comment: 15 pages, 9 figures, 7 table
Personalized DP-SGD using Sampling Mechanisms
Personalized privacy becomes critical in deep learning for Trustworthy AI.
While Differentially Private Stochastic Gradient Descent (DP-SGD) is widely
used in deep learning methods supporting privacy, it provides the same level of
privacy to all individuals, which may lead to overprotection and low utility.
In practice, different users may require different privacy levels, and the
model can be improved by using more information about the users with lower
privacy requirements. There are also recent works on differential privacy of
individuals when using DP-SGD, but they are mostly about individual privacy
accounting and do not focus on satisfying different privacy levels. We thus
extend DP-SGD to support a recent privacy notion called
(,)-Personalized Differential Privacy ((,)-PDP),
which extends an existing PDP concept called -PDP. Our algorithm uses a
multi-round personalized sampling mechanism and embeds it within the DP-SGD
iterations. Experiments on real datasets show that our algorithm outperforms
DP-SGD and simple combinations of DP-SGD with existing PDP mechanisms in terms
of model performance and efficiency due to its embedded sampling mechanism.Comment: 10 pages, 5 figure
Data Collection and Quality Challenges in Deep Learning: A Data-Centric AI Perspective
Data-centric AI is at the center of a fundamental shift in software
engineering where machine learning becomes the new software, powered by big
data and computing infrastructure. Here software engineering needs to be
re-thought where data becomes a first-class citizen on par with code. One
striking observation is that a significant portion of the machine learning
process is spent on data preparation. Without good data, even the best machine
learning algorithms cannot perform well. As a result, data-centric AI practices
are now becoming mainstream. Unfortunately, many datasets in the real world are
small, dirty, biased, and even poisoned. In this survey, we study the research
landscape for data collection and data quality primarily for deep learning
applications. Data collection is important because there is lesser need for
feature engineering for recent deep learning approaches, but instead more need
for large amounts of data. For data quality, we study data validation,
cleaning, and integration techniques. Even if the data cannot be fully cleaned,
we can still cope with imperfect data during model training using robust model
training techniques. In addition, while bias and fairness have been less
studied in traditional data management research, these issues become essential
topics in modern machine learning applications. We thus study fairness measures
and unfairness mitigation techniques that can be applied before, during, or
after model training. We believe that the data management community is well
poised to solve these problems
Inspector Gadget: A Data Programming-based Labeling System for Industrial Images
As machine learning for images becomes democratized in the Software 2.0 era,
one of the serious bottlenecks is securing enough labeled data for training.
This problem is especially critical in a manufacturing setting where smart
factories rely on machine learning for product quality control by analyzing
industrial images. Such images are typically large and may only need to be
partially analyzed where only a small portion is problematic (e.g., identifying
defects on a surface). Since manual labeling these images is expensive, weak
supervision is an attractive alternative where the idea is to generate weak
labels that are not perfect, but can be produced at scale. Data programming is
a recent paradigm in this category where it uses human knowledge in the form of
labeling functions and combines them into a generative model. Data programming
has been successful in applications based on text or structured data and can
also be applied to images usually if one can find a way to convert them into
structured data. In this work, we expand the horizon of data programming by
directly applying it to images without this conversion, which is a common
scenario for industrial applications. We propose Inspector Gadget, an image
labeling system that combines crowdsourcing, data augmentation, and data
programming to produce weak labels at scale for image classification. We
perform experiments on real industrial image datasets and show that Inspector
Gadget obtains better performance than other weak-labeling techniques: Snuba,
GOGGLES, and self-learning baselines using convolutional neural networks (CNNs)
without pre-training.Comment: 10 pages, 11 figure